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Abstract 

This  paper  proposes  a  simple  analytical  model  called  M  time-scale  Markov  Decision  Process 
(MMDP)  for  hierarchically  structured  sequential  decision  making  processes,  where  decisions  in 
each  level  in  the  M-level  hierarchy  are  made  in  M  different  time-scales.  In  this  model,  the  state 
space  and  the  control  space  of  each  level  in  the  hierarchy  are  non-overlapping  with  those  of 
the  other  levels,  respectively,  and  the  hierarchy  is  structured  in  a  “pyramid”  sense  such  that  a 
decision  made  at  level  m  (slower  time-scale)  state  and/or  the  state  will  affect  the  evolutionary 
decision  making  process  of  the  lower  level  m  +  1  (faster  time-scale)  until  a  new  decision  is 
made  at  the  higher  level  but  the  lower  level  decisions  themselves  do  not  affect  the  higher  level’s 
transition  dynamics.  The  performance  produced  by  the  lower  level’s  decisions  will  affect  the 
higher  level’s  decisions.  A  hierarchical  objective  function  is  defined  such  that  the  finite-horizon 
value  of  following  a  (nonstationary)  policy  at  the  level  m  +  1  over  a  decision  epoch  of  the  level 
m  plus  an  immediate  reward  at  the  level  m  is  the  single  step  reward  for  the  level  m  decision 
making  process.  From  this  we  define  “multi-level  optimal  value  function”  and  derive  “multi-level 
optimality  equation” .  We  discuss  how  to  solve  MMDPs  exactly  or  approximately  and  also  study 
heuristic  on-line  methods  to  solve  MMDPs.  Finally,  we  give  some  example  control  problems  that 
can  be  modeled  as  MMDPs. 
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1  Introduction 

Hierarchically  structured  control  problems  have  been  studied  extensively  in  many  contexts  in  vari¬ 
ous  areas  over  the  past  years  with  certain  types  of  models  and  assumptions.  This  is  because  many 
large-scale  control  problems  that  arise  in  real  applications  show  multi-dimensionally  interdependent 
organized  behavior  among  subsystems  that  constitute  the  whole  system  as  decision  blocks.  Two 
distinguished  hierarchical  structures  studied  in  the  literature  are  “multi-level  structure”,  where 
decision  making  algorithms  in  different  levels  operate  in  different  time  scales  (see,  e.g.,  [22])  and 
“multi-layer  structure”,  where  algorithms  are  divided  “spatially”  and  operate  at  the  same  time 
scale  (see,  e.g.,  [13]). 

This  paper  focuses  on  the  control  problems  with  a  particular  multi-level  structure  —  hierarchi¬ 
cally  structured  sequential  decision  making  processes,  where  decisions  in  each  level  in  the  hierarchy 
are  made  in  different  time-scales  and  the  hierarchy  is  structured  in  a  pyramid  (bottom-up  organi¬ 
zation)  sense.  That  is,  decisions  made  in  the  higher  level  affect  the  decision  making  process  of  the 
lower  level  but  the  lower  level  decisions  do  not  affect  the  higher  level  (state  transition)  dynamics 
even  though  the  performance  produced  by  the  lower  level  decisions  will  affect  the  decisions  that 
will  be  made  by  the  higher  level. 

An  usual  approach  to  the  multi-level  structured  problems  is  that  a  slow  time-scale  subsystem 
lays  aside  the  details  of  a  fast  time-scale  dynamics  by  “average”  behavior  and  then  solves  its 
own  optimization  problem  (see,  e.g.,  [14]  [6]  or  chapter  11  in  [33]).  This  approach  makes  sense 
especially  when  the  hierarchy  in  the  system  is  structured  in  the  pyramid  sense.  This  pyramid¬ 
like  structure  was  used  in  the  perspective  of  “performability”  and  “dependability”  in  Trivedi  et 
aids  models  [23]  [15]  [25]  even  though  controls  are  not  involved  in  the  models.  They  proposed 
a  hierarchical  performability  and  dependability  model,  where  the  performance  models  (fast  time- 
scale  model)  are  solved  to  obtain  performance  measures,  termed  as  quasi-steady  state  performance. 
These  measures  are  used  as  reward  rates  which  are  assigned  to  states  of  the  dependability  model 
(slow  time-scale  model).  The  dependability  model  is  then  solved  to  obtain  performability  measures. 
The  lower  level  is  modeled  by  a  continuous-time  Markov  chain  and  the  upper  level  is  modeled  by 
a  Markov  reward  process. 

In  this  paper,  we  propose  a  simple  analytical  model  that  generalizes  Trivedi  et  aids  hierarchical 
model  by  incorporating  controls  into  the  model,  which  we  refer  to  as  Multi-time  scale  Markov 
Decision  Process  (MMDP).  The  model  describes  interactions  between  levels  in  a  hierarchy  in  the 
pyramid  sense.  Each  level  is  associated  with  distinct  state  and  action  spaces.  That  is,  we  assume 
that  no  two  level  share  any  state  or  control  action.  The  upper  level  state  and/or  control  induces 
the  lower  level  MDP  dynamics  over  a  finite  horizon  of  length  corresponding  to  the  decision  epoch 
of  the  upper  level.  In  other  words,  a  particular  pair  of  the  upper  level  state  and/or  control  will 
determine  the  lower  level’s  state  transition  function  and  reward  function.  Hierarchical  objective 
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functions  are  defined  such  that  the  (quasi-steady  state)  performance  measure,  the  finite  horizon 
value  of  following  a  given  lower  level  policy,  obtained  from  the  lower  level  over  the  decision  epoch 
of  the  upper  level  will  affect  the  upper  level  decision  making.  From  this  we  define  “multi-level 
value  function”  and  then  drive  “multi-level  optimality  equation”  for  infinite  horizon  discounted 
reward  and  average  reward  case,  respectively.  We  study  how  to  compute  the  optimal  multi-level 
value  function  exactly  and  approximately.  We  present  an  approximation  method  suited  for  solving 
MMDPs  and  analyze  its  performance  and  discuss  how  to  apply  some  previously  published  on-line 
solution  schemes  in  the  context  of  MMDPs. 

In  addition  to  inherently  existing  hierarchical  and  multi-time  scale  control  structure  in  problems 
themselves  that  arise  in  many  different  contexts,  our  model  is  motivated  by  in  particular  the 
observation  made  in  the  networking  literature  recently.  The  network  traffic  shows  fluctuations 
on  multiple  time-scales  —  scale  invariant  burstiness  (see,  e.g.,  [41]),  and  this  characteristic  in 
the  network  traffic  has  been  well-studied  by  “long-range  dependent”  or  “self-similar”  model  (see, 
e.g.,  [42]).  However,  there  are  several  recent  works  that  investigated  the  effects  of  such  multi¬ 
time  scaled  behavior  by  certain  relevant  Markovian  models  that  approximate  the  fluctuations  in 
the  traffic  (see,  e.g.,  [34]  [37]  [24]  and  references  therein).  The  usual  interests  are  in  calculations  of 
buffer  overflow  probability  distribution  but  are  not  concerned  with  development  of  analytical  multi¬ 
time  scaled  controls  that  incorporate  given  traffic  models  for  such  behaviors  of  the  network  traffic 
even  though  some  non-Markovian  model  based  approaches  are  available  (see,  e.g.,  [38]  and  [16]).  For 
example,  the  slow  time-scale  ( “call- level” )  relates  to  the  arrival  and  departure  process  of  video/voice 
calls  and  the  fast  time-scale  (“packet-level”)  relates  to  the  packet  arrival  process  of  calls  during 
their  “lifetimes”.  This  different  time-scaled  dynamics  causes  fluctuations  in  the  traffic  at  different 
time-scales  and  gives  rise  to  a  multi-time  scaled  queueing  control  problem1  and  we  believe  that  we 
need  to  develop  an  analytical  model  to  approach  this  kind  of  control  problems. 

This  paper  is  organized  as  follows.  We  present  a  formal  description  of  MMDPs  and  characterize 
optimal  solutions  for  MMDPs  in  Section  2  and  discuss  solution  methodologies  in  Section  3.  We  then 
discuss  relevant  related  works  of  hierarchical  models  with  our  model  in  Section  4.  We  give  some 
representative  example  problems  for  MMDPs  in  Section  5  and  conclude  our  paper  in  Section  6. 

2  Multi-time  Scale  MDP 

We  first  present  the  two  time-scale  MDP  model  for  simplicity.  The  M  time-scale  model  with  M  >  2 
can  be  extended  from  the  two  time-scale  model  without  difficulty  and  we  will  discuss  this  issue  later. 


1We  will  discuss  this  example  in  more  detail  in  the  example  problem  section. 
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2.1  Model  description 

The  upper  level  (slow  time-scale)  MDP  has  finite  state  space  I  and  finite  action  space  A.  At  each 
(discrete)  decision  time  n  £  {0, 1,  2, }  and  at  state  in  £  I,  an  action  Xn  £  A  is  taken  and  in  makes 
transition  to  state  in+\  £  I  according  to  probability  Pu(in+i\in,  \n).  Depending  on  which  action 
has  been  taken  at  which  state  in  the  upper  level  MDP,  the  lower  level  (fast  time-scale)  MDP  over 
one-step  slow  time-scale  period  is  determined  accordingly  (what  we  mean  by  this  will  be  clearer 
below).  Every  MDP  in  the  lower  level  shares  the  same  state  and  action  space.  We  denote  the  finite 
state  space  and  the  finite  action  space  by  X  and  A,  respectively.  We  assume  that  I  n  X  =  0  and 
A  n  A  =  0,  which  means  that  the  state  and  control  spaces  of  the  upper  level  and  the  lower  level  are 
distinct  or  non-overlapping.  We  also  assume  that  every  control  action  is  admissible  at  each  state 
in  each  level  for  simplicity. 

We  denote  time  in  the  fast  time-scale  as  t  £  {to,tiA2,  ■■■}  and  tnT  =  n,  n  =  0,1,...  and  T  is 
a  fixed  finite  scale  factor  between  slow  and  fast  time-scales.  We  implicitly  assume  that  tnT  =  n+ . 
That  is,  there  is  an  infinitesimal  gap  between  tnr  and  n  such  that  a  fast  time-scale  decision  at  time 
tnT  is  made  slightly  after  a  slow  time-scale  decision  at  time  n  has  been  made. 

Let  the  initial  state  in  the  lower  level  MDP  x  £  X  and  the  initial  state  in  the  upper  level  MDP 
i  £  I  (xt0  =  x  and  io  =  *  at  n  =  0).  An  action  Ao  £  A  will  be  taken  at  io  and  the  next  upper 
level  state  i\  will  be  determined  stochastically  by  Pu.  Over  the  time  steps  of  to,  t\,  ...,tT-i,  the 
system  follows  the  lower  level  MDP  evolution.  That  is,  at  the  state  x  at  to,  an  action  a  £  A  is 
taken  and  x  makes  transition  to  the  next  state  y  £  X,  which  is  the  state  at  time  t\ ,  according  to 
probability  Pl(y\x,  a,  i,  A)  and  the  nonnegative  and  bounded  reward  of  Rl(x,a,i,  A)  is  incurred  and 
this  process  is  repeated  at  the  state  y  at  ti ,  and  so  forth  until  the  time  tr-i-  That  is,  the  state 
transition  function  and  the  reward  function  in  the  lower  level  MDP  (over  T-epoch)  are  induced 
by  the  upper  level  state  and  decision.  At  time  n  =  1,  an  upper  level  action  Ai  will  be  taken 
at  i\  (this  will  trigger  a  new  MDP  determination)  and  starting  with  a  state  z  at  tr  (determined 
stochastically  from  Pl(z\xtT_1 ,  atT_1 ,  io,  Ao)  for  now  —  we  will  consider  a  distribution  over  X  called 
^-initialization  function  later  as  a  method  of  determining  the  states  of  xnT  for  all  n),  the  newly 
determined  lower  level  MDP  evolves  (over  the  next  T-epoch).  See  Figure  1  for  graphical  illustration 
of  time  evolution  in  this  process. 

Throughout  this  paper,  we  will  use  the  term  “decision  rule”  related  with  infinite  horizon  and  the 
term  “policy”  related  with  finite  horizon.  Define  a  lower  level  decision  rule  dl  =  {tt^},  n  =  0, 1, ...,  as 
a  sequence  of  T-horizon  nonstationary  policies  defined  such  that  for  all  n,  irln  =  {<f>tnT , ...,  ^t(„+1)T_1 } 
is  a  sequence  of  functions  where  for  all  k  >  0,  (j>tk  :  X  x  I  x  A  — >  A.  We  will  say  that  a  lower  level 
decision  rule  is  stationary  with  respect  to  the  slow  time-scale  n  if  nln  =  irln,  for  all  n,  n'  and  we  will 
restrict  ourselves  to  only  this  class  of  decision  rules  here.  We  will  denote  the  set  of  all  possible  such 
stationary  decision  rules  with  respect  to  the  slow  time-scale  as  T>1 ,  and  omit  the  subscript  n  in  7 xln 
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P 1  and  R1  determined  by  i  and  X  at  n=  I 


Figure  1:  Graphical  illustration  of  time  evolution  in  the  two  time-scale  MDPs 


in  this  case  and  use  the  time  to,  ...,t-T-\  to  refer  the  sequence  of  functions  of  it1  if  necessary,  and 
denote  rf  as  the  set  of  all  possible  such  T-horizon  nonstationary  policies  irl.  We  will  also  omit  the 
subscript  on  (j)  if  it 1  is  stationary  (with  respect  to  the  fast  time-scale). 

Given  a  lower  level  decision  rule  dl  G  T>1  and  a  nonnegative  and  bounded  immediate  reward 
function  Iu  defined  over  I  x  A  for  the  upper  level,  we  define  a  function  Ru  such  that  for  all  n  >  0, 
for  x  6  X,  in  G  I  and  Xn  G  A, 


hn+l)T-l 


Ru(x,in,Xn,TTl)  =  E^n  Xn  <  ^2  aa(t)Rl(xt,Mxt,in,  An),in,  An)  >  +  Iu(in,  A„),0  <  a  <  1,  (1) 


t—tnT 


where  a(tnT+r)  =  r  for  all  n  with  r  =  0,l,...,T  —  1,  and  the  superscript  x  on  E  signifies  the  initial 
state,  xtnT  =  x,  and  the  subscript  in,  Xn  on  E  signifies  that  in  and  An  for  the  expectation  are  fixed. 
We  will  use  this  notational  method  throughout  the  paper.  The  function  Ru  is  simply  the  T-horizon 
total  expected  (discounted)  reward  of  following  the  T-horizon  nonstationary  policy  irl  given  in  G  I 
and  Xn  G  A  starting  with  state  x  G  X  with  the  zero  terminal  reward  function2  plus  an  immediate 
reward  of  taking  an  action  A„,  at  the  state  in  at  the  upper  level,  and  is  a  bounded  function. 

The  total  expected  (discounted)  reward  achieved  by  the  lower  level  T-horizon  nonstationary 
policy  tt1  with  an  immediate  reward  at  the  upper  level  will  act  as  a  single-step  reward  for  the  upper 
level  MDP.  Define  the  upper  level  stationary  decision  rule  du  as  a  function  du  :  X  x  I  — >  A  and 
we  denote  Vu  as  the  set  of  all  possible  such  stationary  decision  rules.  Given  the  initial  x  G  X  and 
i  G  /,  our  goal  is  to  obtain  a  decision  rule  pair  of  dl  G  T>1  and  du  G  T>u  that  achieves  the  following 
functional  value  defined  over  X  x  I: 


V*(x,i) 


max  max  Ex’1 

du£Vu  s&v1 


T  i.xtnTi  ini  d  i.xtnT,in)i 


,0  <  7  <  1. 


2It  is  our  assumption  that  the  initial  state  for  the  next  epoch  in  the  slow  time-scale  does  not  contribute  the  reward 
for  the  previous  epoch.  However,  a  terminal  reward  can  be  defined  by  a  function  over  _Y,  in  which  case  we  need  to 
add  the  terminal  reward  term  in  Ru. 
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{oo  Hn+  1)T-1 

Y  1nEZn\n  Y  a(7(t)Rl  (xt,  Mxt ,  in,  du(xtnT,in)),in,  du(xtnT,in )) 

71=0  t=tnT 


+T“(in ,  Xn) 


where  we  will  refer  to  V*  as  the  two-level  optimal  infinite  horizon  discounted  value  function. 

The  second  functional  value  defined  as  our  objective  function  is 

1  •  (H~1  ) 

J*(x,i)  :=  max  max  lim  —Ex'1  <  V]  Ru(x tnT,in ,du(x tnT,in) V)  >, 
dueVu  dl&Vl  tf->oo  H 

\  n= 0  ) 

where  we  refer  to  J*  as  the  two-level  optimal  infinite  horizon  average  value  function. 

We  can  see  that  from  the  definition  of  the  upper  level  decision  rule,  the  decisions  to  be  made 
at  the  upper  level  must  depend  on  the  lower  level  state,  which  is  the  initial  state  for  the  lower 
level  MDP  evolution  over  T-horizon  in  the  fast  time-scale.  The  initial  state  xtnT,n  =  1,2,... 
is  determined  stochastically  by  following  the  policy  it1.  We  will  consider  more  general  case  of 
determining  the  initial  state  in  the  later  subsection  to  expand  the  flexibility  of  our  model.  We  also 
remark  that  even  though  we  added  the  immediate  reward  function  Iu  in  the  definition  of  Ru  to 
make  our  model  description  more  natural,  the  function  Rl  can  “absorb”  the  function  Iu  by  newly 
defining  the  function  Rl  itself  as 

Rl(x,  i,  A,  it1)  <—  Rl(x,  i,  A,  irl)  +  —  Tu(i,  A)  for  a  =  1  and 

Rl(x,i,  A,  tt1)  <—  Rl(x,i,  A,  tt1)  +  ^ — Y\  Tu(i,  A)  for  0  <  a  <  1. 

2.2  Optimality  equations 

Because  the  upper  level  sequential  dynamics  is  essentially  just  an  MDP  with  a  reward  function 
defined  via  the  lower  level  MDP  dynamics  (by  fixing  a  lower  level  decision  rule),  with  a  simple 
adaptation  of  the  standard  MDP  theory  (see,  e.g,  [1]  [3]  [18]  or  [30]),  the  following  results  hold  for 
MMDPs.  Therefore,  we  omit  detailed  proofs. 

The  first  theorem  yields  an  optimality  equation  satisfied  by  V* .  We  first  define  a  set  of  all 
possible  T-horizon  (lower  level)  nonstationary  policies  under  a  fixed  pair  of  an  upper  level  state 
and  an  upper  level  action.  For  a  given  pair  of  i  E  I  and  A  £  A, 

n'[i,  A]  :=  |vr'[i,  A]  nl[i,X]  :=  {$;\  -v,cy(A  ;  }.o[jA  :  I  x  [i]  x  {A}  — >  A  and  k  =  0,  ...,T  -  1  j 

and  let  Vjy(irl[i,  X])  the  probability  that  state  y  6  X  is  reached  by  T-steps  starting  with  x  by 
following  the  T-horizon  nonstationary  policy  tt1  [i ,  A] .  Note  that  this  probability  can  be  obtained 
from  Pl. 
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Theorem  2.1  For  all  x  G  X  and  i  G  I , 


V*(x,  i )  =  max 
AeA 


max 

7r^[i,A]Gn^[2,A] 


Ru(x,i,X,TTl[i,X})  +  7^ 

2/6X  je/ 


xy 


and  V*  is  the  unique  solution  to  the  above  equation.  Furthermore,  for  each  pair  of  x  and  i,  let  the 
arguments  that  achieve  the  r.h.s  of  this  equation  as  X*  and  n*[i,  A]  =  {4>tk}>  and  set  du(x,i)  =  X* 
for  du  and  set  irl  such  that  4>tk(x,i,  X*)  =  4>tk(x,  i,  X*)  for  dl .  The  pair  of  du  and  dl  achieves  V* . 


Proof:  We  will  define  an  MDP  that  operates  in  the  slow  time-scale  n  as  follows.  The  state  at  time 
n  is  a  pair  of  the  lower  level  state  and  the  upper  level  state,  (xtnT,in).  An  action  at  state  (xtnT,in) 
is  a  composite  control  of  Xn  G  A  and  7r;[in,An]  G  II*[in,An]  (from  our  assumption  that  tnT  =  ro+, 
7r*[zn,An]  will  be  taken  slightly  after  An  is  taken).  Observe  that  we  can  view  irl  as  one-step  action 
at  the  slow  time-scale.  More  precisely,  the  admissible  action  set  for  state  (xtnT,in)  is  defined  as  the 
set  given  by 

{(A,t)|A  G  A,t  G  Ul[in,  A]} 

The  transition  probability  from  (xtnT,in)  to  (xt^n+1)T ,  in+ 1)  is  determined  directly  from  VT  and  Pu. 
Then,  from  the  standard  MDP  theory,  for  this  MDP,  we  can  write  Bellman’s  optimality  equation 
and  an  optimal  decision  rule  that  achieves  the  unique  optimal  value  at  each  state  is  derived,  from 
which  we  can  conclude  our  result.  I 


Even  though  we  assumed  finite  state  spaces  with  finite  action  spaces,  the  issue  of  infinite/finite 
state/action  space  and  bounded/unbounded  reward  function  can  be  discussed  from  the  well-known 
MDP  theory.  We  refer  [1]  for  a  substantial  discussion  in  this  matter.  We  now  state  a  similar 
result  to  the  well-known  fact  for  the  average  reward  case  in  the  MDP  theory  for  the  function  J* 
we  defined. 


Theorem  2.2  If  there  exists  a  bounded  function  (  defined  over  X  x  I  and  a  constant  g  such  that 
for  all  x  G  X  and  i  G  I, 


g  +  ((x,  i)  =  max 
AeA 


max 
Tr^Aleih^A] 


Ru{x,i,X,Trl[i,X])  +  ^2  ^2  V- 

yex  jei 


xy 


(irl[i,  X\)Pu(j\i,  X)((y,  j) 


(3) 

then  there  exists  a  decision  rule  pair  of  du  G  Vu  and  dl  G  V1  that  achieves  J*(x,i)  and  g  =  J*(x,i) 
for  all  x  and  i. 


For  conditions  that  make  the  “if’  part  of  the  above  theorem  hold,  refer  [1]  or  [18]  for  a  substantial 
discussion  in  the  MDP  context.  An  optimal  decision  rule  pair  can  be  obtained  by  similar  way  to 
the  method  stated  in  Theorem  2.1. 
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2.3  Initialization  function 


So  far  we  considered  the  case  where  xtnT,n  =  1,2,,..  is  determined  by  the  T-horizon  nonstationary 
policy.  In  the  model  we  described  before,  xtnT,n  =  1,2, ...  is  (stochastically)  determined  by  follow¬ 
ing  the  policy  tt1  with  given  xt0.  Considering  more  general  model,  we  first  define  an  initialization 
function  5.  We  then  determine  xtnT,n  =  1,2,...  by  the  function  5.  That  is,  the  lower  level  MDP 
initial  state  for  each  new  T-period  (in  the  fast  time-scale)  is  initialized  by  the  function  5.  This  is 
motivated  by  problem  specific  nature  —  organizing  behavior  in  a  hierarchy. 

Here  are  some  examples  of  5.  As  in  the  previous  description  of  the  model,  5  can  be  a  function 
defined  over  Xxlx  A  such  that  for  given  x ,  i,  A,  S(x,  i,  A)  is  a  probability  distribution  over  X.  Given 
x  £  X,  i  £  I  and  A  £  A,  we  will  use  the  notation  of  5(x,i,  A)[y]  to  denote  the  probability  defined 
on  y  £  X  by  5(x,i,  A).  In  the  previous  model  description,  5(x,i,\)[y]  corresponds  to  Vjy(nl[i,  A]). 
From  now  on,  we  will  use  the  notation  5^  to  explicitly  express  the  dependence  on  the  lower  level 
policy  tt1  if  that  is  the  case.  Or  it  can  be  defined  such  that  the  determination  of  xtnT  depends  on 
the  state  xtnT_1.  For  example,  for  some  x,  y  £  X,  i  £  I,  A  £  A, 

6n\x,i,X)[y\  = 

zex 

where  p(y\z)  denotes  a  probability  of  choosing  y  given  z. 

For  some  cases,  the  slow  time-scale  decisions  (e.g.,  “reset”  control,  etc.)  only  will  affect  the  new 
initial  lower  level  state.  In  this  case,  5  is  defined  over  X  x  A  such  that  S(x,  A)  gives  a  probability 
distribution  over  X.  The  very  idea  of  this  5  is  parallel  to  the  transition  structure  in  Markovian 
slowscale  model  given  in  [20].  Finally,  the  determination  of  xtnT  can  be  independent  of  xtnT~\  or 
xt(n-i)T ■  For  example,  we  can  consider  the  state  in  the  lower  level  is  initialized  depending  on  the 
upper  level  current  state  i  and  the  next  state  j.  5  is  defined  over  I  x  I  such  that  5(i,j)  for  some 
i ,  j  £  I  gives  a  probability  distribution  over  X. 

The  5-initialization  function  gives  much  more  flexibility  in  our  multi-time  scale  model.  Depend¬ 
ing  on  the  problem  structure,  5  can  be  defined  accordingly.  With  the  introduction  of  5,  we  simply 
need  to  rewrite  V*  equation  (similarly  to  the  J*  case)  as 


V*(xA)  =  max  |  max 
AeA  \  7t^[z,A] 


Ru(x,i,  \,TTl[i,  X\)  +  (x,i,  \)[y\Pu(j\i,  \)V*(y,j)  >  ,x  £  X,i  £ 


yex  jel 


where  we  used  the  first  5  example  function  that  depends  on  irl.  In  particular,  if  the  5- function  is 
independent  of  tt l[i,  A]  (or  we  will  say  that  the  5-function  is  independent  of  the  lower  level  policies), 
then  we  can  write  the  above  equation  as 


V*(x,i)  =  max  <  max  {Ru(x,  i,  A,  7p))  +  7  V'  V  S(x,  i,  \)[y]Pu(j\i,  X)V*(y,  j) 
Ae/V  7r![*>A]  ^  ^ 


,x  £  X,i  £  I. 
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This  special  case  is  very  interesting  because  the  optimal  finite  T-horizon  value  at  the  lower  level 
will  act  as  a  single  step  reward  for  the  upper  level  (along  with  immediate  reward).  The  upper 
level  decision  maker  in  this  case  directs/determines  a  problem  at  each  time  that  the  lower  level 
decision  maker  needs  to  solve  and  the  lower  level  decision  maker  seeks  a  “local”  optimal  solution 
for  T-horizon  and  follows  one  of  the  optimal  nonstationary  policies  that  achieve  the  solution.  The 
decision  process  of  how  to  direct  a  problem  at  each  time  for  the  upper  decision  maker  will  depend 
on  the  local  optimal  performance  made  by  the  lower  decision  maker.  In  this  sense,  this  has  a  flavor 
of  the  underlying  philosophy  of  the  Stackelberg  (leader- follower)  game  (see,  e.g,  [2]).  This  is  not 
true  in  general  because  ^-initialization  function  may  depend  on  the  lower  level  policy  7r l[i,  A],  where 
in  this  case  the  lower  level  decision  maker  needs  to  choose  a  policy  not  only  concerned  with  the 
local  performance  of  the  policy  but  also  effects  of  the  policy  in  the  future  performance. 

2.4  Extension  to  more  than  two  time-scales 

In  this  subsection,  we  briefly  discuss  how  to  extend  the  two  time-scale  model  to  more  than  two  time- 
scale  model  by  illustrating  the  three  time-scale  model.  So  we  introduce  the  higher  level  (referred 
to  as  the  highest  level)  decision  making  process  that  evolves  with  time  s  and  affects  the  dynamics 
of  MDP  in  the  n  time-scale  (denoted  as  n  €  {no,  ni, ..., })  and  nsu  =  s+,s  =  0,1,2,...  and  H  is 
the  finite  scale  factor.  We  denote  the  state  and  the  action  space  of  the  highest  level  as  M  and 
\k,  respectively.  The  notation  used  in  this  subsection  coincides  with  the  previous  sections  to  avoid 
confusement. 

The  transition  structures  are  now  defined  to  be  given  x,y  £  X,  i,j  £  I,  m,m'  £  M  and 
a  £  A,  A  £  A,  ip  £  'k,  Pu(j\i,  A,  m,  ip)  for  the  upper  level,  and  P\y\x,  a,  i,  A,  m,  ip)  for  the  lower 
level,  and  Ph(m'\m,  ip)  for  the  highest  level.  The  decision  rule  for  the  highest  level  dh  is  defined 
to  be  stationary  decision  rule  dh  :  X  x  I  x  M  — >  \k,  and  the  the  decision  rule  for  the  upper 
level  du  is  defined  such  that  du  =  {7r“},  s  =  0,1,...,  as  a  sequence  of  iThorizon  nonstationary 
policies  where  for  all  s,  7 r“  =  {XnsII,  ■■■,Xn(s+1-)H~1}  is  a  sequence  of  functions  where  for  all  k  >  0, 
Xnk  :  X  x  I  x  M  x  'k  — >  A.  The  decision  rule  du  is  stationary  with  respect  to  the  time-scale  s. 
Similarly,  we  define  the  lower  level  decision  rule  dl  =  {^ln}  where  irln  =  {4>tnT,  •••,  (Pt^n+1)T_1 }  and 
(ptk  :Ax/xAxMxf-+d  for  all  k  >  0.  The  decision  rule  dl  is  stationary  with  respect  to  the 
time-scale  n. 

The  lower  level  reward  function  Rl  is  now  a  function  of  x,  a,  i,  A,  m,  ip  and  the  upper  level  single 
step  reward  function  Ru  is  now  a  function  of  x,  i,  A,  m,  ip,  irl  and  defined  from  Rl  with  an  immediate 
reward  function  at  the  upper  level.  That  is,  the  highest  level  state  and  decision  will  affect  the  upper 
and  lower  MDP  reward  functions.  The  single  step  reward  function  Rh  for  the  highest  level  is  a 
function  of  x,  i,  rn,  iru,  tt1  and  defined  from  Ru  over  iThorizon  in  the  time-scale  n  and  an  immediate 
reward  function  at  the  highest  level.  From  this  reward  functions,  we  can  define  the  three  level 
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optimal  value  function  and  determine  the  three  level  optimality  equation. 

3  Solving  MMDPs 

The  methods  of  obtaining  the  optimal  decision  rule  for  each  level  in  MMDPs  are  well-established 
via  the  well-known  MDP  theory.  We  will  pay  attention  to  ^-initialization  function  that  depends  on 
the  lower  level  policies  and  is  defined  over  Ix/xA  such  that  5nl(x,i,  A)  for  x  G  X,  i  €  /,  and 
A  £  A  gives  a  probability  distribution  over  X.  The  discussion  here  can  be  easily  extended  to  other 
^-functions. 


3.1  Exact  methods 

We  first  discuss  discounted  case  and  then  average  case.  Define  an  operator  0  such  that  for  a 
(bounded  and  measurable)  function  V  defined  over  X  x  /, 


©<y)0M)  =  max  max  <  Ru(x,  i,  A,  nl[i,  A])  +  7  V  V  5n\x,  i,  X)[y\Pu(j\i,  \)V{y,  j) 

AeA  U«M]en«M]  1 


for  all  x  and  i.  Then,  0  is  a  7- contraction-mapping  in  sup-norm.  For  any  function  V  defined  over 
X  x  I,  let  || V||  =  supx  i  \V(x,i)\.  For  any  bounded  and  measurable  two  function  U  and  V  defined 
over  X  x  I,  it  is  true  that 

!««•')  •  O(V'  )  |  <  7  \U  -  V’  |. 

This  implies  that  V*(x,  i )  is  unique  from  the  well-known  fixed  point  theorem.  Furthermore,  for  any 


such  V. 


Qn(V)  ->  V*  as  n  -»•  00, 


where  this  method  is  known  as  value  iteration. 

For  the  average  reward  case,  we  assume  that  (appropriately  modified)  one  of  the  ergodicity 
conditions  in  page  56  [18]  holds.  Then,  average  reward  value  iteration  can  be  also  applied  [18].  Let 
be  an  operator  that  maps  a  function  V  defined  over  X  x  I  to  another  function  defined  over  X  x  I 
given  by 


HV)(x,i)  =  max  max  l  Ru(x,  i,  A,  nl[i,  A])  +  V  V  (T*  (a;,  i,  X)[y]Pu(j\i,  X)V(y,  j) 
aca  U«MGn«M  1 


for  all  x  and  i.  Then,  with  an  arbitrary  (bounded  and  measurable)  function  V  defined  over  X  x  I, 
for  all  x  G  X,i  G  I, 

<hn(E)(a:,z)  —  $n_1(E)(x, *)  — >  g  as  n  — *  00 
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and  for  any  fixed  state  pair  y  £  X  and  j  £  I, 

$n(W)(x,  i)  —  &n(V)(y,  j)  —>  £(x,  i )  as  n  — *  oo,  x  £  X,  i  £  I. 

We  can  also  use  “ policy  iteration”  once  Ru  is  determined.  See,  e.g.,  [26]  for  average  reward  case 
and  [30]  for  discounted  case. 


3.2  Approximation  methods 

There  are  numerous  approximation  algorithms  to  solve  MDPs.  We  resort  to  the  books  by  Puter- 
man  [30]  or  by  Bertsekas  and  Tsitsiklis  [5],  etc.  for  a  substantial  discussion.  In  this  section,  we 
analyze  the  performance  of  an  approximation-based  scheme  for  solving  MMDPs. 

Our  first  approximation  is  on  the  ^-initialization  function.  One  of  the  main  difficulties  to  obtain 
an  optimal  decision  rule  pair  would  be  the  given  initialization  function’s  possible  dependence  on 
the  lower  level  nonstationary  policies.  Suppose  that  is  true  and  consider  a  ^'-initialization  function 
that  is  independent  of  the  lower  level  policies  and  approximates  the  given  5nl  -initialization  function 
with  respect  to  a  given  metric.  Then  there  exists  a  unique  function  V*  defined  over  X  x  I  such 
that  for  all  x  and  i, 

V*(x,i)  =  max  <J  max  (ru(x,  i,  A,  Trl[i,  A])')  +7  V'  V]  S'(x,  i,  X)[y]Pu(j\i,  X)V*(y,j) 

AeA  7r![i,A]en![i,A]  V  / 


ye-X  jel 


We  can  bound  then  | V*(x,i)  —  V*(x,i)\  for  all  x  and  i  by  Theorem  4.2  in  the  Muller’s  work  [27] 
with  a  metric  called  “integral  probability  metric”  on  the  difference  between  6'  and  5nl .  Of  course, 
if  the  MMDP  problem  to  solve  is  associated  with  the  lower  level  policy  independent  ^-function,  we 
wouldn’t  need  this  approximation  step. 

The  second  approximation  is  on  the  value  of  R*  defined  as 


R*(x,i,  A)  =  max  (ru(x,  i,  A,  ^[z,  A])') 

7zl  [i,A]En*  [i,A]  V  / 


and  on  V*(x,i).  It  will  be  often  impossible  to  get  the  true  R*  due  to  the  huge  state  space  size  of 
the  lower  level  and  even  more  computationally  cumbersome  if  T  is  relatively  large  even  if  |X|  is  a 
moderate  sized  number  even  though  theoretically  we  can  use  “backward  induction” .  Obtaining  the 
true  value  of  the  function  V*  is  also  almost  infeasible  in  many  cases  with  similar  reasons.  Suppose 
we  approximate  R*  by  R  such  that 

sup  | R*(x,  i,  A)  —  R(x,  i,  A)|  <  k 

x,i,  A 

and  V*  by  some  bounded  and  measurable  function  U  defined  over  X  x  I  such  that 

sup  \V*(x,  i )  —  U(x,i)\  <  e. 

x,i 
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We  will  discuss  an  example  of  such  R  and  of  the  function  U  later  in  this  subsection.  Now  define  a 
stationary  (upper  level)  decision  rule  du  such  that  for  all  x  €  X  and  i  G  /, 


du(x,  i)  =  argrnax 
AeA 


R(x,i,  A)  +  7  EE  5'(x,  i,  X)[y]Pu(j\i,  A )U(y,j) 

y&X  jei 


Our  goal  is  to  bound  the  performance  of  the  decision  rule  du  from  V*.  We  define  the  value  of 
following  the  decision  rule  du  given  an  initialization  function  5'  as  follows: 


V(x,i)  =  Ex/ 


7  R(xtnT,in,d  (xtnT,zn)) 


where  we  used  (by  abusing  the  notation)  E$i  to  indicate  that  xtnT,n  =  1,2,...  is  a  random  variable 
denoting  (lower  level)  state  at  time  tnT  determined  stochastically  by  S'.  We  now  state  a  performance 
bound  as  a  theorem  below. 


Theorem  3.1  If  sup^.  i  A  \R*{x,  i,  A)  —  R(x,  i,  A)  |  <k  and  swpxi\V*(x,i)  —  U(x,i)\  <  e, 

I V*(x,i)  —  V(x,i) I  <  ^ 6  —  for  all  x  6  X  and  i  G  I. 

1-7 

Proof:  Let  the  argument  that  achieves  the  r.h.s  of  Equation  (4)  with  replacing  5nI  by  S'  be  A u 
for  a  function  U.  We  will  use  the  notation  ©'  for  this  replacement.  From  the  contraction  mapping 
property  of  the  ©'  operator,  for  all  x  £  X  and  i  €  I, 

|©/(E*)(x,  i)  —  Q'(U)(x ,  i)|  <  7  •  sup  \V*(x,  i )  —  U(x,i)  \  <  7c. 

x,i 

We  show  that  | &{U)(x,i)  —  V(x,i)\  <  ^eP+j)+^  for  all  x  e  X  and  i  G  I.  It  then  follows  that  from 
&(V*)  =  V*, 

\V*(x,i)  -V{x,i)\  < 

< 

which  gives  the  desired  result. 

Now,  for  all  x  €  X  and  i6/, 

&'(U)(x,i)  =  R*(x,  i,  Xu)  +  7  EE  S'(x,i,Xu)[y]Pu(j\i,Xu)U(y,j)  by  the  definition  of  ©' 

yex  jei 

<  R(x,i,Xjj)  +  k  +  7  EE  6'(x,  i,  Xu)[y\Pu(j\i,  Xjj)U(y,  j)  by  the  given  assumption 

yex  jei 

<  R(x,  i,  du(x,  i))  +  k  +  7  EE  S\x,  i,  du(x,  i))[y]Pu{j\i ,  du(x,  i))U{y,j) 

yex  jei 

by  the  definition  of  du 


I e'(V*)(x,i)  -  &'(U)(x,i)\  +  I Q'(U)(x,i)  -  V(x,i) I 

7e(l  +  7)  +  K  2je+K 


Te¬ 


l-7 


1-7 
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<  R(x,  i,  du(x,  i))  +  k  +  7  EE  5'(x,  i,  du(x,  i))[y]Pu(j\i ,  du(x,  i))[V*(y,j)  +  e] 

yex  jei 

=  R{x,  i,  du(x,  i))  +  7  EE  6'(x,  i,  du(x,  i))[y] Pu (j\i,  du(x,  i))V*(y,j)  +  'ye  +  k 

y&X  jei 

<  R{x,  i,  du(x,  i))  +  7  EE  8'(x,i,du(x,i))[y\Pu(j\i,du(x,i))[0'(U)(y,j) +^e]  +  ye+  k 

yex  jei 

=  R(x,i,  du(x,i))  +  7  EE  8'(x,i,du(x,i))[y\Pu(j\i,du(x,i))@'(U)(y,  j)  +7e(l  +  7)  +  k 

yex  jei 

<  R(x,  i,  du(x,  i))  +  7  EE  i,  du(x,  i))[y]Pu(j\i,  du(x,  i))  R{y,j ,  du(y,j))  +  k 

yex  jei 

+7  EE  <*' (y ,  J,  du (y ,  j ) )  [2] P“  (/c | j,  cT (y ,  j) ) U(z,  fc)]  +  7e(l  +  7)  +  k 

zex  kei 

=  R(x,  i,  du(x,  i))  +  7  EE  <*'(£,  h  du(x,  i))[y] Pu (j\i,  du(x,  i))R(y,j ,  du{y,j)) 

yex  jei 

+72  ^  S5,(s’*’dU(x,i))[?/]‘pU(J'l®,dU(:r’®)) 
yex  jei 

x[EE  <5'(y,  j,  rfu(?/,  fc)j  +  7e(l  +  7)  +  «( 1  +  7) 

zex  fee/ 

<  i?(x,  i,  du(x,  i))  +  7  EE  d''(x,  i,  du(x,  i))[y]Pu(j\i,  du(x,  i))R(y,j ,  du(y,j)) 

yex  jei 

+' 72  5^<J/(®,*,du(x,i))[j/]Pw(j|i)du(x,i)) 

yex  ie/ 

*[EE 

zex  kei 

+72e(l  +  7)  +  7e(l  +  7)  +  k(1  +  7) 

Keep  iterating  (under  the  sum  sign)  this  way,  we  have  that  for  all  1  =  0,1, ...,  and  x  G  X,i  G  /, 

■  1 

Q'(U)(x,i)  <  Ey  ^2'/nR(xtnT,in,du(xtnT,in))  +  ^l+1  E5,[Q' (U){xt(l+1)T,ii+1)} 

_n= 0 

+7e(l  +  7)  H - 1-  7Z+1e(l  +  7)  +  k(1  +  7  H - f  7*).  (6) 

Since  @'(U)  is  bounded,  the  second  term  on  the  r.h.s.  of  Equation  (6)  converges  to  zero  as  l  — >  00 
and  the  first  term  becomes  V (x,  i).  Therefore  it  follows  that  Q'(U)(x,  i)  —  V (. x ,  i )  <  JlDLMlllii .  This 
proves  the  upper  bound  case. 

For  the  lower  bound  case,  observe  that  for  all  x  €  X  and  i  G  I, 

e'(U)(x,i)  =  R*(x,i,  Xu)  +7  EE  5'(x,i,\u)[y\Pu(j\i,\u)U(y,j) 

yex  jei 

>  R*(x,i,du(x,i))  +  7^  ^S\x’i’dU(x’i))[y}pUti\i’dU(x’i))U(yD) 

yex  jei 

by  definition  of  A jj 
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>  R(x,i,c(u(x,i))  —  k  +  7  EE  5'(x,i,du(x,i))[y\Pu(j\i,du(x,i))[V*(y,j)  -  e]. 

y£X  jei 


By  the  similar  arguments  for  the  upper  bound  case,  we  can  then  show  that  Q'(U)(x,i)  —  V(x,i)  > 
—  This  concludes  our  proof.  ■ 


We  remark  that  a  related  work  for  this  theorem  can  be  found  in  Corollary  1  in  [35]  with  the 
assumption  of  the  finite  state  space  and  the  result  of  the  work  only  gives  an  upper  bound.  Our 
analysis  takes  a  totally  different  approach  and  can  be  applied  to  Borel  state  space  even  though 
our  proof  shows  for  the  countable  case.  Furthermore,  the  result  gives  not  only  a  lower  bound  but 
also  a  tighter  bound.  Now  we  give  an  example  of  R.  From  now  on,  we  assume  that  the  lower  level 
reward  function  R1  is  defined  such  that  it  absorbs  the  upper  level  immediate  reward  function  Tu 
as  we  discussed  in  subsection  2.1.  Our  approximation  uses  a  heuristic  lower  level  policy  tt1  that 
guarantees  the  T-horizon  total  expected  discounted  reward  of  following  the  policy  tt1  is  within  an 
error  bound  from  the  optimal  finite-horizon  value. 

The  methodology  of  the  example  is  the  rolling  horizon  approach  [19]  where  we  choose  a  horizon 
h  -C  T  and  solve  for  the  optimal  h- horizon  total  expected  discounted  reward  and  we  define  a 
(greedy)  stationary  policy  with  respect  to  the  value  function.  We  begin  by  defining  /i-horizon  total 
expected  discounted  reward  with  h  =  1,  ...,T  for  every  given  i  £  I  and  A  £  A: 

R*h(x,i,\)  =  max  Efx 

7T 1  [i,A] 

where  Rq(x,  i,X)  =  0  for  all  x  £  X.  We  also  let  R*\x,i,  A)  =  Ru(x,  i,  A,  irl[i,  A])  defined  in 
Equation  (1)  for  every  i  and  A  with  0  <  a  <  1  and  Rmax  =  max-x^x  Rl(x,  a,  i,  A). 


h  —  1 


atR\xt ,  (jffX(xt ,  i,  A),  i,  A)  >  ,  0  <  a  <  1,  i  £  I,  A  £  A, 


(7) 


t= o 


Proposition  3.1  For  every  given  i  G  I  and  A  £  A  and  a  selected  h  in  {1,...,T},  define  a  lower 
level  stationary  policy  TTl[i,  A]  as 


4>z’X(x,  i,  A)  =  argmax  R\x ,  a,  i,  A)  +  a  Pl(y\x,  a,  i,  A )Rjl_1(y,  i,  A)  for  all  x  £  X. 

'■  1  \  ttx  1 


Then,  for  all  x,  i,  A, 


n  ^  r>*  /  ■  \\  r>  7T*  /  •  \  \  ^  7?maxO;ft'(l  O?) 

0  <  R  (x,  i,  A)  —  R  (x,  i,  A)  < - , 

1  —  a 


Proof:  The  lower  bound  is  from  the  definition  of  R*.  Fix  arbitrary  i  £  I  and  A  £  A.  Define  an 
operator  that  maps  a  (bounded)  function  V  defined  over  X  to  another  function  defined  over  X 
given  by 

/ 

R\x,a,i,X)+aY,P‘  (y\x,a,i,\)V(y) 

y&X 


D,(V)(x)  =  max 
a£A 


(8) 
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It  is  well-known  that  R*h  =  ^(Rq)  (see,  e.g.,  [3]  [18],  etc.).  By  the  contraction  mapping  property 
of  n,  (with  ||/||  =  sup^  |/(x,*,A)|), 


\Rr-RU  <  a\\R*T_1-R*h_1W<---<ah\\R*T_h-R*0\ 


<  aft(l  +  a  +  •  •  •  +  aT  h  1)Rma,x  < 


^max(®  Oi  ) 

1  —  a 


(9) 


Now  by  the  definition  of  irl  [i ,  A]  (the  following  proof  idea  is  from  [19]), 

Rh(x,  *> A)  =  R\x,  4>l’X{x,  i,  A),  i,  A)  +  a  ^  P'(y|x,  </>*’A(z,  i,  A),  i,  A )R*h_1(y,  i,  A) 

y 

<  Rl(x,  f’x(x,  i,  A),  i,  A)  +  a  ^  P*(y|x,  </>*’A(x,  i,  A),  i,  X)R*h(y,  i,  A) 

y 

Keep  iterating  under  the  sum  sign,  we  have  that  for  all  w  =  0, 1, ...,  T  —  1  and  for  all  x  £  X, 


R*h(x,i,\)<E*x 


^2  atRl(xt,  (j)l’x(xt,  i,  A),  i,  A) 


,t= o 


+  au,+1JBiiAK(xU)+1,z,A)].  (10) 


We  let  w  =  T  —  1.  It  follows  then  that  from  the  previous  inequality  (10), 

-RmaXaT(l  -  ah) 


Rh(x,  i,  A)  <  Rn  (x,  i,  A)  + 


1  —  a 


Therefore,  we  have 


R*(x,  i,  A)  -  R*\x,  i,  A)  <  R*(x,  i,  A)  -  Rt(x,  i,  A)  + 

1  —  a 

Combining  the  result  in  Equation  (9)  with  the  previous  inequality,  we  finally  have  that 

-RmaxcAl  -  aT) 


R*(x,i,  A)  -  Rn  (x,  i,  A)  < 


1  —  a 


For  every  given  k  >  0,  letting  k  >  ^max^  T  a  )  gives  t}ie  rolling  horizon  size  for  a  desired  error 
bound  for  R*.  We  remark  that  by  letting  T  — >  oo,  the  above  result  precisely  gives  the  result  of 
Theorem  3.1  in  [19].  The  similar  approach  can  be  taken  for  the  upper  level  MDP.  We  can  choose 
a  fixed  horizon  and  use  the  horizon  as  the  rolling  horizon.  The  value  function  defined  by  the 
horizon  approximates  V* ,  i.e.,  an  example  of  U.  If  both  levels  use  the  rolling  horizon  approach,  we 
have  two-level  approximation.  We  can  easily  draw  an  error  bound  of  this  two-level  rolling  horizon 
approach  from  the  results  obtained  in  this  subsection.  In  practice,  getting  the  true  value  of  R*h 
will  be  also  difficult  even  though  h  is  small  due  to  the  curse  of  dimensionality.  A  way  of  getting 
away  with  a  large  state  space  is  to  use  a  sampling  method  to  approximate  R*h  probabilistically.  See, 
e.g.,  [21]  in  this  direction  and  an  analysis  for  discounted  case. 
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For  the  average  reward  case,  we  consider  the  case  where  one  of  the  ergodicity  conditions  in  page 
56  [18]  holds.  Furthermore,  we  assume  that  the  similar  approximation  to  the  first  approximation 
for  the  discounted  case  is  done  by  a  ^-initialization  function  that  is  independent  of  the  lower  level 
policies  and  approximates  the  given  5n  -initialization  function.  That  is,  there  exists  a  constant  g 
and  a  function  £  such  that  for  all  x  and  i, 


g  +  ((x,i)  =  max 
AeA 


max 
7r*[i,A]en*[i,A] 


Ru(x,  i,  A,  7 Tl[i,  A 


+  EE  8'(x,i,X)[y\Pu(j\i,\)((y,j) 

y£X  jei 


and  that  | g  —  g\  is  bounded  with  respect  to  the  degree  of  the  approximation  by  S1  for  5^ . 

We  focus  on  the  second  approximation  for  the  average  case.  We  will  denote  R [  defined  in 
Equation  (7)  with  a  =  1  as  R%  and  Rn  =  Ru(x,  i,  A,  7rl[i,  A])  defined  in  Equation  (1)  with  a  =  1, 
and  the  operator  fl  in  Equation  (8)  with  a  =  1  as  Cl.  Suppose  that  we  approximate  R*(=  Rif)  by 
R  as  before  such  that 

sup  \R*(x,  i ,  A)  —  R(x,  i,  A)|  <  k 

x,i,  X 

and  that  £  is  approximated  by  some  bounded  and  measurable  function  U  defined  over  X  x  /  such 
that 

sup  |C(t,  i)  —  U ( x ,  i)|  <  e. 

x,i 

Define  a  stationary  (upper  level)  decision  rule  du  such  that  for  all  x  G  X  and  i  £  /, 


du{x,i)  =  arg max  R(x,i,  A)  +  V  V  S\x,i,  \)[y]Pu(j\i,  X)U(y,  j)  . 

AeA  V  y£X  ieI  J 

The  value  of  following  the  decision  rule  du  given  an  initialization  function  5'  is  defined  as  follows: 


J(x,i)  =  Yim^Ey  R{xtnT,in^{xtnT,in))  \  . 


{H- -1 


oo  H 

We  now  state  a  performance  bound  as  a  theorem  below. 


Theorem  3.2  Assume  that  one  of  the  ergodicity  conditions  in  page  56  in  [18]  holds.  If 
supj,  j  A  | R*(x,  i,  A)  —  R(x,  i,  A)|  <  k  and  supx  i  |C(^,  *)  —  U(x,i)\  <  e, 


| g  —  j(x,  i)|  <  2e  +  k  for  all  x  G  Xand  i  G  I. 

The  proof  of  this  theorem  can  be  done  via  adaptation  of  the  proof  of  Theorem  3.2  in  [10]  (using 
the  invariant  probability  distribution)  so  we  omit  the  proof.  We  now  provide  a  counterpart  result 
to  Proposition  3.1  for  the  undiscounted  case  ( a  =  1)  under  an  ergodicity  assumption. 

Define  C  :=  {(x,a)\x  G  X,  a  G  A}.  For  every  given  i  G  /  and  A  G  A,  we  define  Rl(c,i,  A)  := 
Rl(x,  a,  i,  A)  and  Pl(y\c,  i,  A)  :=  Pl(y\x,  a,  i,  A)  for  all  c  G  C. 


Multi-time  Scale  Markov  Decision  Processes 


17 


Assumption  3.1  There  exists  a  positive  number  v  <  1  such  that  for  every  given  i  and  A, 

sup  V  \Pl(y\c,i,  A)  -  Pl(y\c',i,  A)|  <2v , 

C’C'&C  y^X 

We  give  a  performance  bound  of  the  rolling  horizon  policy  in  terms  of  span  semi-norm;  for  a 
bounded  function  V  defined  over  X  x  I  x  A  and  fixed  i  E  I  and  A  £  A  (with  abusenrent  of  notation), 
sp( V)  =  supT  V(x,i,  A)  —  infx  V(x,i,  A) . 

Proposition  3.2  Assume  that  the  ergodicity  condition  3.1  holds.  For  every  given  i  £  I  and  A  £  A 
and  a  selected  h  in  {1,  ...,T},  define  a  lower  level  stationary  policy  irl  as 


<f>l’X{x,i,X)  =  argnrax  Rl(x,  a,  i,  A)  +  E  P*(y|x,  a,  i,  A )Rjl_1(y,  i ,  A)  /or  all  x  £  A. 
aeA  V  1/ex  / 


Then,  for  all  i  and  A, 


sP(ir  -  ir')  <  r  • lfinmx  + 2(^  zyr)fmax 

1  —  n  (1  —  z/)2 


Proof:  We  begin  with  a  slightly  modified  version  of  Theorem  4.8(a)  [18]  by  Lemma  below.  See 
the  proof  there. 

Lemma  3.1  Assume  that  the  ergodicity  condition  3.1  holds.  For  every  given  i  £  I  and  A  £  A  and 
h  =  1,  ...,T,  there  exists  a  constant  j*  such  that  for  all  x  £  X, 

(a)  R*h{x,i,X)  -i%_i(M,A)  >  +j* 

(b)  R*h(x,i,X)  -Rl^fiX)  < 

Fix  i  and  A.  Let  p\  =  +  j*  and  p2  =  —  1!^,max  +  j* ■  With  a  similar  reasoning  in 

the  proof  of  Proposition  3.1  and  with  the  inequality  in  Lemma  3.1(a),  we  can  deduce  that  for  all 
w  =  0, 1, ...,  T  —  1  and  for  all  x  £  X, 


Rh(x,i,  A)  <  Efx  s^R\xtAl,x{xt,i,X),i,\)  +  Eitx[R*h(xw+i,i,  A)]  -  (w  +  l)pi 
_t= o 

We  let  w  =  T  —  1.  It  follows  then  that  from  the  previous  inequality, 

R*h(x,i,\)  <R7rl(x,i,\)  +  Ei>x[R*h(xT,i,\)}-Tp1. 

By  the  same  arguments,  we  have  that 

RUx,i,X)>R^\x,i,X)+Ei:X[R*h(xT,i,X)\-Tp2. 

Combining  the  above  two  inequalities,  it  follows  that 

sp (&t  -  R-‘)  <  Tte  -  w)  =  T  ■  V:1JW. 
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Now,  from  the  span  semi-norm  contraction  property  of  fl  [18],  we  have  that 

sP(^t  -  <  v sP(i?T-r  -  R*h- 1)  <  ■  ■  ■  <  ^  sp (Rt-h)- 

From  Lemma  3.1,  we  can  also  deduce  that  for  all  x  G  X, 


(12) 


-Rmax(l  L'  )  ^  ft?*  t  ^  V  )  l,  ■* 

- (1  _  „)2  +  h3  <Rh(x,h  A)<  (1_^2  ~ 

Therefore,  sp(-R£_/l)  <  - 2.  Combining  Equation  (11)  and  (12)  with  the  previous  in¬ 

equality,  we  have  the  desired  result: 


sp(iT  —  Rw  )  <T  ■ 


2uh  1  i?max  ,  2(z/h  -  zv'7  )/? 


/i.  ,,T\ 


1  -  i/ 


+ 


(1-^ 


(13) 

■ 


We  remark  that  the  above  result  also  gives  a  bound  on  finite  horizon  average  reward  by  dividing 
the  both  hand  sides  of  Equation  (13)  by  the  horizon  T.  In  particular,  the  result  by  letting  T  — ►  oo 
in  this  case  does  not  coincide  exactly  with  the  result  obtained  in  Theorem  5.1  in  [19]  —  our  result  is 
loose  by  a  factor  of  2  in  terms  of  span  semi-norm  even  though  the  upper  bound  part  in  Theorem  5.1 
would  be  the  same.  This  is  because  the  lower  bound  on  the  result  of  Theorem  5.1  is  0  incorporating 
the  fact  that  the  infinite  horizon  average  reward  of  any  stationary  decision  rule  is  no  bigger  than 
the  optimal  infinite  horizon  average  reward,  where  we  couldn’t  take  advantage  of  the  fact  in  our 
proof  steps. 

Suppose  we  have  a  lower  level  policy  dependent  initialization  function  and  we  now  know  that 
the  set  of  local  optimal  lower  level  policies  that  solve  the  lower  level  MDP  problem  for  given  i  G  / 
and  A  £  A.  As  we  can  observe,  a  lower  level  decision  rule  determined  from  these  policies  doesn’t 
necessarily  achieve  the  optimal  multi-level  value  because  it  is  a  locally  optimal  or  greedy  choice. 
However,  solving  the  optimality  equation  given  in  Theorem  2.1,  for  example,  is  difficult  because 
the  size  of  the  set  H7  [z,  A]  is  often  huge.  We  should  somehow  utilize  the  fact  that  we  know  the  local 
optimal  lower  level  policies.  To  illustrate  this,  we  study  this  case  for  discounted  case  only  here.  For 
this  purpose,  let  n*[i,  A]  set  of  irl[i,  A]’s  that  solve  the  lower  level  MDP  problem  for  given  i  £  /  and 
A  £  A,  i.e. ,  achieving  R*.  We  then  define  a  pair  of  upper  and  lower  level  decision  rules,  du  and  dl, 
from  the  arguments  that  achieve  the  following  equation: 


max 

AeA 


max 
7Tl  [i,A]En*[i,A] 


R 


l(x,  i,  A,  irl[i,  A])  +  7  EE 


y£X  jel 


such  that  we  set  du(x,  i)  =  A  and  set  dl  =  {7 t1},  where  A  and  tt1  [i.  A]  are  the  arguments  that  achieve 
the  above  equation.  We  let  the  two-level  value  of  following  the  pair  of  du  and  dl  be  V(x,  i).  We 
also  let  the  optimal  pair  of  upper  and  lower  level  policies  that  achieve  V*(x,  i )  for  all  x  and  i  as  d% 
and  d\  (associated  with  7 r*).  Our  goal  is  to  bound  the  error  between  these  two  value  functions. 
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Let  us  first  impose  an  ergodicity  assumption  that  there  exists  a  positive  number  /x  <  1  such 
that  for  any  x,  x'  and  i,  i'  and  for  any  A,  and  any  tv,  tv'  €  ih, 


EE 

y&X  jel 


i,  X)[y]Pu(j\i,  A)  -  8*'(x',  i' ,  \'){y]Pu(j\i! ,  A') 


<  2/x. 


Observe  that  for  every  x  and  i, 


Ru(x,  i,  <%(x,  i),  tv[  [i,  <K(x,  i)])  +  7  EE  (T*  (x,  i,  <%(x,  i))[y\Pu(j\i,  <%(x,  i))V*{y,j) 

y£X  jel 

<  Ru(x,i,d“(x,i),Tvl[i,d%(x,i)])  +  7  EE  5n*  (x,  i,  <%(x,  i))[y]Pu (j\i,  d£(x,  i))V*{y,j) 

yex  jel 

by  the  definition  of  tv l[i,  d%(x,  *)] 

<  Ru(x,  i,  (%(x,  i) ,  tv1  [i,  d% (x,  i)])  +  7  EE  Ml  (x,  i,  d^(x,  i))[y\Pu(j\i,  d“(x,  i))V*(y,  j) +jysp(V* 

yex  jel 

by  incorporating  the  ergodicity  condition  (see,  e.g.,  page  60  in  [18]) 

<  Ru(x,i,du(x,i),Tvl[i,du(x,i)})  +  7  EE  8n\x,  i,  du(x,  i))[y\Pu(j\i,  du{x ,  i))V*(y,j)  +  jfx  •  sp(E* 

yex  jel 

by  the  definitions  of  du(x,i )  and  7 vl\i,du(x,i)\ 

Then,  it  immediately  follows  that  for  all  x  and  i, 

V*(x,i)  -  V(x,i)  =  Ru{x,i,d^(x,i),Tv[[i,d^(x,i)\)  -  Ru(x,i,du(x,i),itl[i,du(x,i)]) 

+7  EE  (x,  i,  d?(x,  i))[y]Pu(j\i,  c%(x,  i))V*(y,j)  -  5n\x,  i,  du{x,  i))[y\Pu{j\i,  du(x,  i))V (y,j) 
yex  jel 

sEE  8*\x,i,<t{x,i))[y\Pu{j\i,du{x,i))[V*{y,j)  -  V(y,j)\  +  7//  •  sp(V*) 
yex  jel 

from  the  inequality  of  the  previous  observation. 

Now  by  majorization, 

max(E*(x,  i )  —  V(x,  *))  <  7  ■  max(V*(x,  i )  —  V(x,  i))  +  y/x  •  sp(E*) 

x,i  x,i 

Therefore,  for  all  x  and  i  with  0  <  a  <  1, 

(1  -7)(1  -  «) 

Note  that  we  can  define  du  and  dl  with  respect  to  a  bounded  value  function  U  that  approximates 
V*  and  draw  an  error  bound  from  V*  by  using  the  above  result  we  just  have  drawn. 


3.3  Heuristic  on-line  methods 

The  discussion  so  far  dealt  with  “off-line”  methods  for  solving  MMDPs.  Even  though  various 
approximation/exact  algorithms  can  be  applied  for  some  control  problems,  it  will  often  require 
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analyzing  and  utilizing  structural  properties  on  the  problems,  which  might  be  very  cumbersome 
in  many  interesting  problems.  In  this  section,  we  discuss  how  to  apply  previously  published  some 
on-line  (sampling-based)  heuristic  techniques  in  the  context  of  MMDPs. 

The  first  example  approach  called  “(parallel)  rollout”  is  based  on  the  decision  rule/policy  im¬ 
provement  principle  in  the  “policy  iteration”  algorithm  (see,  e.g.,  [4]  [9]  [10]).  We  simulate  or 
rollout  a  heuristic  decision  rule  available  in  on-line  manner  via  Monte-Carlo  simulation  at  each 
decision  time  and  use  the  estimated  value  of  following  the  heuristic  decision  rule  by  simulation  to 
create  an  (approximately)  improved  decision  rule  with  respect  to  the  heuristic  decision  rule.  A 
generalization  of  the  single  decision  rule  rollout  is  to  rollout  in  parallel  multiple  heuristic  decision 
rules  and  use  the  maximum  estimated  value  among  heuristic  decision  rules  to  create  a  new  decision 
rule.  This  is  particularly  useful  if  system  trajectories  or  sample  paths  can  be  divided  in  a  way 
that  a  particular  heuristic  decision  rule  is  near-optimal  for  particular  system  trajectories.  The 
parallel  rollout  method  yields  a  decision  rule  that  dynamically  combines  the  multiple  decision  rules 
automatically  adapting  to  different  system  trajectories  and  improves  the  performances  of  all  of  the 
heuristic  decision  rules.  There  are  several  works  that  reported  this  (parallel)  rollout  approach  is 
quite  successful.  See,  e.g.,  [10]  and  references  therein  for  the  works  in  this  direction. 

We  briefly  discuss  how  to  apply  rollout.  Suppose  we  have  a  heuristic  decision  rule  pair  of  dl 
for  the  lower  level  and  du  for  the  upper  level.  At  each  decision  time  n  (in  the  slow  time-scale),  we 
measure  the  utility  of  taking  each  candidate  action  A  £  A  as  follows.  We  take  a  candidate  action 
(in  an  imaginary  sense)  and  then  from  the  next  step,  we  simulate  dl  and  du  over  a  finite  horizon 
via  Monte-Carlo  simulation  over  many  random  simulated  traces,  giving  the  approximate  value  of 
following  the  decision  rule  pair.  The  single  step  reward  of  taking  action  A  associated  with  the  lower 
level  quasi-steady  state  performance  is  also  estimated  by  simulation  by  following  the  decision  rule 
dl.  The  sum  of  the  estimated  single  step  reward  (plus  the  immediate  reward  of  taking  A)  plus 
the  estimated  value  of  following  the  decision  rule  pair  dl  and  du  gives  the  utility  measure  of  the 
candidate  action  A.  At  each  time  n,  we  take  the  action  with  the  highest  utility  measure.  At  the 
fast  time  scale,  we  just  follow  the  decision  rule  dl. 

The  Monte-Carlo  simulation  method  is  an  example  for  simulation  strategy.  Depending  on 
problem  nature,  we  can  use  other  simulation  methods,  e.g.,  single-path  simulation,  TD(A)  (see, 
e.g.,  [5]),  importance  sampling,  etc.  In  particular,  the  single-path  simulation  for  the  lower  level 
MDP  will  be  useful  if  T  is  relatively  large  and  the  underlying  Markov  chain  induced  by  a  decision 
rule  pair  is  ergodic.  In  fact,  we  can  draw  a  probabilistic  error  bound  on  the  estimate  of  the  value 
of  following  a  decision  rule  (or  a  decision  rule  pair  in  our  context)  from  the  single  path  simulation 
from  the  true  value  of  the  decision  rule  with  respect  to  the  degree  of  ergodicity  and  the  simulation 
horizon  by  using  Theorem  2  given  in  [17]. 

The  above  (parallel)  rollout  approach  can  be  referred  as  a  lower  bound  approach  as  the  value 
of  following  any  decision  rule  pair  is  a  lower  bound  to  the  optimal  finite-horizon  value.  The  next 
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example  called  “hindsight  optimization”  [11]  is  based  on  the  upper  bound.  Hindsight  optimization 
can  be  viewed  as  a  heuristic  method  of  adapting  the  (deterministic)  optimal  sample-path  based 
solutions  into  an  on-line  solution.  Instead  of  evaluating  a  decision  rule  pair  by  simulation  as  in 
the  rollout,  for  each  simulated  random  trace  of  the  system,  the  optimal  control  action  sequence 
that  maximizes  the  reward  sum  is  obtained.  The  average  over  many  random  traces  will  give  an 
upper  bound  on  the  optimal  finite-horizon  value.  We  use  this  upper  bound  in  the  action  utility 
measure.  The  hindsight  optimization  approach  turns  out  to  be  effective  in  some  problems  (see, 
e.g.,  [8]  [43])  even  though  the  question  of  when  this  approach  is  useful  is  still  open.  However,  we 
note  that  as  long  as  the  ranking  of  the  utility  measures  of  candidate  actions  reflects  well  the  true 
ranking  (especially  the  highest  one),  these  heuristic  methods  can  be  expected  work  well. 

4  Related  Works 

In  this  section,  we  compare  several  key  papers  that  can  be  related  with  our  work  in  hierarchical 
modeling  subject. 

We  first  discuss  a  key  paper  by  Sutton  et  al.  [36]  because  this  paper  cites  almost  all  of  the  hier¬ 
archical  MDP  works  in  (at  least)  artificial  intelligence  literature  and  some  in  the  control  literature 
and  generalizes  the  previous  works  by  one  framework.  For  many  interesting  decision  problems  (e.g., 
queueing  problems),  the  state  spaces  in  different  levels,  ( X  and  /),  are  non-overlapping.  Sutton 
et  aV s  work  considers  a  multi-time  MDP  model  in  the  dimension  of  the  action  space  only  (action 
hierarchy)  by  defining  “options”  or  “temporally  extended”  actions.  That  is,  the  state  spaces  in 
different  time-scales  are  the  same  in  the  Sutton’s  model.  An  option  O  consists  of  three  components: 
a  stochastic  terminating  function  g,  the  set  of  states  that  O  can  be  taken,  and  a  mapping  /  from 
state  to  action.  Once  an  option  O  is  taken  from  a  state,  the  action  from  /  is  taken  at  each  decision 
time,  with  possibly  terminating  via  the  terminating  function.  In  particular,  if  g  specifies  the  time 
duration  of  applying  the  function  /,  the  option  O  is  called  semi-Markov  option.  That  is,  a  sequence 
of  actions  in  the  action  space  of  the  given  MDP  is  a  temporally  extended  action  or  an  option.  Note 
that  this  option  does  not  determine  or  change  the  underlying  reward  or  state  transition  structure. 
On  the  other  hand,  in  our  model,  the  upper  level  action  A  £  A  is  not  temporally  extended  action 
from  the  action  space  of  the  lower  level  MDPs  but  is  a  control  at  its  own  right.  We  can  roughly  say 
that  the  lower  level  policies  defined  over  different  upper  level  state  and  controls  are  semi-Markov 
options  that  depend  on  the  upper  level  state  and  action. 

A  similar  hierarchical  structure  in  the  dimension  of  only  action  space  was  studied  in  the  Markov 
slowscale  model  and  the  delayed  slowscale  Model  by  Jacobson  et  al.  [20].  They  consider  two  level 
action  hierarchy,  where  the  upper  level  control  is  not  necessarily  an  option.  However,  the  upper  level 
control  does  not  change  the  transition  and  reward  structure  of  the  whole  T-horizon  evolutionary 


process. 


Multi-time  Scale  Markov  Decision  Processes 


22 


The  recent  work  by  Ren  and  Krogh  on  multi-mode  MDPs  [31]  studies  a  nonstationary  MDP, 
where  a  variable  called  the  system  operating  mode  determines  evolution  of  the  MDP,  which  operates 
in  the  one  time-scale.  We  can  view  our  model  as  a  generalization  of  multi-mode  MDPs  by  viewing 
each  upper  level  decision  and/or  state  as  a  system  operating  mode. 

Even  though  the  situation  being  considered  is  totally  different,  Pan  and  Basar’s  work  [29] 
considers  a  class  of  differential  games  that  exhibit  possible  multi-time  scale  separation.  Given  a 
problem  defined  in  terms  of  a  singularly  perturbed  differential  equation,  differently  time-scaled 
games  are  identified  and  each  game  is  solved  independently  and  from  this  a  composite  solution  is 
developed,  which  is  an  approximate  solution  for  the  original  problem.  In  our  model,  the  upper  level 
MDP  solution  must  depend  on  the  solution  for  the  lower  level  MDPs. 

Finally,  as  we  mentioned  before  we  can  view  our  model  as  an  MDP-based  extension  or  a  gen¬ 
eralization  of  Trivedi’s  hierarchical  performability  and  dependability  model.  In  Trivedi’s  work, 
the  performance  models  (fast  time-scale  model)  are  solved  to  obtain  performance  measures  (corre¬ 
sponding  to  R*  roughly  under  the  lower  level  policy  independent  5- function).  These  measures  are 
used  as  reward  rates  which  are  assigned  to  states  of  the  dependability  model  (slow  time-scale) .  The 
dependability  model  is  then  solved  to  obtain  performability  measures.  The  lower  level  is  modeled 
by  a  continuous-time  Markov  chain  and  the  upper  level  is  modeled  by  a  Markov  reward  process  (al¬ 
ternatively,  generalized  stochastic  petri  net.  can  be  used).  We  can  see  that  if  we  fix  the  upper  level 
and  the  lower  level  decision  rules  in  our  model  with  the  lower  level  policy  independent  ^-function, 
an  MMDP  becomes  (roughly)  the  model  described  by  Trivedi’s  —  in  our  model,  the  lower  level 
model  is  also  a  Markov  reward/decision  process. 

5  Example  Problems 

There  are  many  interesting  problems  that  can  be  modeled  by  MMDPs.  In  this  section  we  illustrate 
this  by  considering  several  representative  examples  in  different  contexts.  The  description  for  each 
problem  is  rather  abstract  but  detailed  enough  to  convey  the  idea  of  the  MMDP  modeling.  We  first 
discuss  a  hierarchical  extension  of  the  well-known  (controlled)  multiarmed  bandit  problem.  The 
next  two  examples  are  queueing  control  problems  that  arise  in  communication  network.  We  then 
consider  a  hierarchical  optimal  asset  allocation  problem  that  arises  in  a  capital  market,  a  hierarchical 
production  planning  problem  in  a  manufacturing  environment,  and  a  hierarchical  employee  staffing 
problem  in  a  service  management  environment.  We  then  give  an  extension  example  of  Trivedi’s 
composite  performance  and  dependability  model  by  incorporating  controls  into  the  model.  As 
the  final  example,  a  stochastic  variant  of  a  deterministic  combinatorial  graph-based  optimization 
problem  is  discussed. 
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5.1  Multiarmed  bandit  problem 


It  is  well-known  that  multiarmed  bandit  problem  models  a  class  of  many  interesting  (stochastic) 
optimization  problem  such  as  project  selection,  clinical  trials,  random  search,  etc.,  just  to  name  a 
few  (see,  e.g.,  [40]  and  the  references  therein)  and  the  problem  can  be  modeled  by  MDP.  It  would 
be  natural  to  consider  a  hierarchical  version  of  this  problem. 

There  are  Q  independent  machines.  At  each  time  n,  we  need  to  select  exactly  one  machine  to 
operate  based  on  both  of  the  upper  and  lower  level  states  of  each  machine.  An  upper  level  state  in 
at  time  n  is  (i^,  •••,  in )  and  the  lower  level  state  xtnT  at  time  n  is  (xjnT,  x“fnT,  xf  ).  Once  a 

particular  machine  is  selected  (upper  level  control),  say  q,  only  the  states  of  the  machine  q  changes 
and  all  of  the  other  machines  freeze.  That  is,  1  is  determined  from  a  Markovian  transition 
function  and  starting  with  x\  ,  the  lower  level  MDP  evolves  according  to  the  lower  level  transition 
function.  All  states  of  the  other  machines  remain  the  same.  Note  that  the  lower  level  transition 
function  will  depend  on  only  the  upper  level  state  of  the  activated  machine.  The  T-epoch  reward 
from  the  lower  level  will  be  incurred  to  the  upper  level  as  one-step  reward  of  operating  the  machine 
q.  The  problem  is  to  maximize  the  expected  total  discounted  sequence  of  (multi-level)  rewards 
by  a  decision  rule  pair  of  operating  machines  at  each  slow  time-scale  time  and  of  controlling  the 
fast-tinre  scale  machine  dynamics. 

This  problem  has  been  solved  by  Gittins  who  gave  an  index  rule:  under  the  assumption  that 
d-function  is  independent  of  the  lower  level  policies,  straightforward  adaptation  of  the  index  rule 
gives  an  index  Gq  for  machine  q  such  that  given  the  lower  level  state  xq  and  the  upper  level  state 


Gq(xq,iq )  =  max 


T>  1 


nR*q{xlA)} 

EXq’iq[T,n= W\ 


where  the  maximization  is  over  the  set  of  all  stopping  times  r  >  1  and  R*  is  the  optimal  T-horizon 
value  associated  with  the  machine  q  for  the  lower  level  MDP.  Then  the  upper  decision  rule  that 
operates  the  machine  with  the  highest  index  at  each  time  n  achieves  the  two-level  optimal  infinite 
horizon  discounted  value. 

Even  though  this  extension  is  interesting,  as  we  mentioned  above,  the  upper  level  decision  does 
not  affect  the  lower  level  MDP  dynamics,  only  the  upper  level  state.  Varaiya  et  al.  [40]  extended 
the  standard  bandit  problem  with  an  additional  freedom:  when  a  particular  machine  is  selected  to 
operate,  a  control  action  that  affects  both  the  reward  and  the  machine  state  transition  must  be  also 
selected  and  call  this  extended  problem  as  controlled  multiarmed  bandit  problem  or  “superprocess” . 
The  superprocess  is  naturally  modeled  by  MMDP  with  the  lower  level  MDP  dynamics  defined  with 
Pl(y\x,  i,  A)  and  Rl(x,a,i,  A)  as  usual  and  the  upper  level  with  Pu(j\i.  A)  and  R*  with  the  lower 
level  policy  independent  5- function.  The  optimal  upper  level  decision  rule  for  this  problem  is 
characterized  by  the  notion  of  “dominating  machine”  and  the  Gittins  index  rule  (see  [40]  for  this 
matter  and  applications  of  the  superprocess). 
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We  believe  that  this  example  is  particularly  worthful  because  it  illustrates  that  there  exists  a 
class  of  problems  that  can  be  modeled  as  MMDPs  such  that  characterization  of  the  optimal  upper 
level  decision  rule  and  efficient  computation  of  it  are  possible. 

5.2  Admission  control  with  buffer  management 

Our  description  of  the  following  example  deals  with  a  continuous-time  domain.  By  the  uniformiza- 
tion  method,  the  continuous-time  model  can  be  casted  into  a  discrete-time  model  (see,  e.g.,  [39]). 
There  are  two  “call-level”  traffic  classes.  For  each  class,  there  is  a  (real  number)  weight  representing 
the  importance  of  the  class.  We  can  either  accept  or  reject  of  each  call  arrival.  The  accepted  calls 
are  aggregated  into  a  single  (finite)  FIFO  buffer.  At  most  N  <  oo  calls  can  be  in  the  system. 

An  upper  level  state  i  G  I  is  (rj i ,  772 ) ,  where  r/c,  c  =  1,2,  is  the  number  of  the  pending  (currently 
effective)  class-c  calls  and  an  upper  level  action  A  G  A  is  (71,72),  where  rc  €  {0,  —1}  and  rc  =  0/( — 1) 
means  accepting/rejecting  a  newly  arrived  class-c  call.  At  slow  time-scale,  class-c  calls  arrive  with  an 
arrival  rate  ac  according  to  Poisson  process  and  each  (accepted)  call’s  holding  time  is  exponentially 
distributed  with  a  mean  mc  from  which  we  can  determine  Pu(j\i,  A)  for  j,i  €.  I  and  A  6  I.  We 
want  to  maximize  the  weighted  number  of  accepted  calls  at  the  upper  level. 

Given  an  upper  level  current  state  i  and  the  current  decision  A  of  rejecting/accepting  the  new 
call(s)  at  the  upper  level  state,  the  number  of  effective  calls  for  the  next  (slow  time-scale)  epoch  will 
be  determined  so  that  the  nature  of  the  packet  arrival  dynamics  would  be  different  over  each  (slow 
time-scale)  epoch.  The  evolution  of  the  upper  level  state  (after  an  appropriate  uniformization)  is 
given  as: 

in+ 1  :=  [max{in  +  (new  call  arrival(s)  at  n),N}  +  An]  —  call  departure(s)  at  (n  +  l)_,*o  =  0’ 

and  [max{in  +  (new  call  arrival(s)  at  n),  N}  +  An]  will  determine  the  packet  arrival  dynamics  over 
[tnT,  £(n+i)T]~Pei'iod  at  the  fast  time-scale.  The  max  operator  in  the  above  equation  is  implicitly 
associated  with  a  discard  rule  —  giving  a  priority  to  the  more  important  class  for  chance  of  being 
accepted  or  rejected  when  we  need  to  discard  calls  due  to  overflow  of  the  buffer.  It  is  assumed  that 
we  have  a  model  for  each  class  that  describes  packet  arrival  dynamics  at  the  fast  time-scale  given 
the  number  of  currently  pending  calls.  The  model  has  a  finite  number  of  the  traffic  states  and  for 
each  traffic  state,  there  is  an  associated  probability  distribution  such  that  it  gives  the  probability 
that  b  e  {60,  b\, ...,  bk},  k  <  00,  rate  of  packets  are  generated  at  the  traffic  state  given  the  number  of 
effective  calls.  Then  the  lower  level  MDP  dynamics  over  each  T-epoch  will  be  determined  through 
the  structure  of  the  model,  which  includes  an  appropriate  choice  of  ^-initialization  function.  We 
remark  that  one  might  want  to  use  a  more  complex  model  that  can  describe  the  real  traffic  more 
suitably.  However,  the  above  model  suffices  to  illustrate  the  usefulness  of  the  MMDP  model. 

The  lower  level  state  x  G  X  is  (aq,  X2,  si,  S2),  where  xc  is  the  number  of  the  class-c  packets  (of 
the  same  size  with  the  unit  time  in  the  fast  time-scale)  in  the  buffer,  and  sc  is  the  traffic  state  of 
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the  class-c,  which  is  observable  for  simplicity  (for  the  unobservable  case,  sc  can  be  a  probability 
distribution  over  the  traffic  states  called  information  state,  in  which  case  the  evolution  of  the  lower 
level  states  is  still  Markovian).  A  lower  level  action  a  £  A  is  where  dc  is  the  number  of 

the  class-c  packet  to  be  dropped  from  the  tail.  The  state  evolution  can  be  expressed  via  Lindley’s 
equation  involving  a  particular  action  for  the  xc  part  and  the  stochastic  transition  of  sc  part. 

We  wish  to  determine  the  value  of  dc’s  at  each  (fast  time-scale)  decision  time  such  that  the 
weighted  average  queue  size  (roughly  corresponding  to  average  waiting  time  of  packets)  is  minimized 
and  at  the  same  time  the  weighted  average  throughput  is  maximized,  where  there  is  a  tradeoff 
parameter  compromising  the  competence  between  throughput  and  queue  size/delay.  A  related 
buffer  management  work  on  a  single  class  problem  has  been  reported  in  [9],  in  which  motivations 
for  such  early  packet  dropping  are  given.  The  lower  level  reward  function  is  given  such  that  we 
want  to  maximize  the  weighted  number  of  accepted  calls  and  the  sum  of  the  weighted  (finite-horizon 
average)  throughput  and  the  weighted  average  queue  length  with  a  tradeoff  parameter  £: 


R\x ,  a,  i,  A) 


dc)  •  (1  -  £(xc 


dc))  +  '^2wc{rc  +  1) 


where  wc  is  the  weight  of  class-c  call/packet. 


5.3  Call  routing  with  buffer  management 

The  next  example  problem  is  a  simple  call-routing  problem  with  buffer  management  and  motivated 
from  a  “load-balancing”  control  problem  that  arises  in  for  example,  traffic  engineering  at  MPLS 
(Multi-Protocol  Label  Switching)  domain  [7]. 

Consider  a  network  with  M  parallel  links  (called  label  switched  path  in  MPLS  domain),  m  = 
1, ...,  M  between  two  points  of  source  and  destination.  At  the  source,  single  class  (voice)  calls  arrive 
with  an  arrival  rate  according  to  Poisson  process  in  a  slow  time-scale.  The  call’s  holding  time  is 
exponentially  distributed  with  a  mean  in  the  slow  time-scale.  Again,  we  can  use  the  uniformization 
method  for  continuous-time  arrival/departure  process.  The  call-level  or  upper  level  decision  process 
is  either  to  reject  a  newly  arrived  call  or  to  decide  where  to  dispatch  or  route  the  newly  arrived  call 
to  one  of  the  M  parallel  links  if  accepted.  We  assume  that  for  each  link,  there  are  (possibly  zero) 
cross  traffic  (video)  calls.  For  simplicity,  the  video  calls  are  initially  set  up  and  do  not  depart  (if  we 
incorporate  the  dynamics  of  video  call  arrival  and  departure  process,  we  would  have  a  three-level 
MMDP  and  the  control  process  in  the  highest  level  is  to  assign  video  calls  among  M  parallel  links 
or  reject). 

It  is  assumed  that  all  of  the  voice  calls  have  the  same  traffic  rate  (i.e.,  bandwidth  requirement) 
and  this  is  also  true  of  the  video  calls.  In  other  words,  the  model  describing  packet  (of  the  same  size 
with  the  unit  time  in  fast  time-scale)  arrival  process  of  the  voice  call  is  the  same.  For  example,  if 
On/Off  model  is  used,  each  arriving  call  has  the  same  On/Off  model  parameters.  This  also  holds  for 
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the  video  calls.  We  may  use,  e.g.,  Markov  modulated  Poisson  process  to  model  video  packet  arrival 
process  [12]  and  it  is  also  assumed  that  each  video  call  shares  the  same  model  parameters.  We 
then  can  induce  a  model  like  the  one  we  discussed  in  the  previous  example  section  for  voice/ video 
traffic  that  describes  packet  arrival  dynamics  at  the  fast  time-scale  given  the  number  of  currently 
pending  voice/video  calls  with  an  appropriate  choice  of  the  ^-initialization  function  at  each  link. 

The  upper  level  state  consists  of  the  number  of  currently  pending  voice  and  video  calls  at  each 
link.  The  lower  level  state  consists  of  the  traffic  states  for  voice  traffic  and  video  traffic  and  the 
number  of  packets  in  the  (finite  FIFO)  buffer  for  voice  and  video  at  each  link.  The  control  action  at 
the  lower  level  is  to  drop  voice  and  video  packets  from  the  tail  as  in  the  previous  example  problem 
at  each  link.  The  lower  level  reward  function  is  given  by  the  sum  of  the  reward  function  at  the 
individual  link  plus  reward  of  routing/rejecting  a  voice  call. 

We  remark  that  as  a  variant  of  the  above  two  examples  discussed  so  far,  we  can  consider  a 
scheduling  problem  instead  of  the  buffer  management  problem. 

5.4  Asset  allocation 

It  is  important  that  capital  market  brokers  make  proper  decisions  on  shifting  their  investments  to 
more  promising  assets  depending  on  market  dynamics.  As  the  next  example  problem  for  MMDP, 
we  consider  a  very  simple  asset  allocation  (portfolio  management)  problem  that  deals  with  the 
investment  of  capital  to  various  trading  opportunities  (e.g.,  different  stocks).  The  example  here  is 
based  on  the  problem  discussed  in  [28]  and  appropriately  modified  into  our  contexts. 

We  assume  that  there  is  a  set  of  market  states  (/  in  the  upper  level)  that  triggers  the  “govern¬ 
ment”  to  change  or  stick  to  its  decision  rule  (A  in  the  upper  level),  which  affects  the  trend  of  the 
exchange  rate  in  the  capital  market.  A  transition  from  a  market  state  to  another  state  would  be 
a  function  of  the  government  decision  rule  and  some  exogenous  random  disturbances.  A  possible 
trend  of  the  exchange  rate  follows  an  increasing  pattern,  but  with  higher  values  of  the  exchange 
rate,  a  drop  to  very  low  values  becomes  more  and  more  probable  [28]. 

A  lower  level  state  x  consists  of  the  current  exchange  rate  e,  the  capital  (the  wealth  of  the 
portfolio)  c,  which  is  always  calculated  in  the  basis  currency,  and  a  variable,  representing  which 
currency  (e.g.,  US  dollars)  is  currently  used  for  the  investment.  The  stochastic  nature  of  the 
exchange  rate  will  determine  the  lower  level  MDP  dynamics.  The  action  at  the  lower  level  to  be 
taken  by  a  broker  is  to  choose  a  proper  currency  for  the  investments,  to  maximize  the  expected 
return  for  a  given  finite-horizon,  (if  a  lower  level  policy  independent  d-function  is  used),  where 
the  return  is  a  function  of  the  exchange  rate,  the  capital,  and  the  transaction  costs.  The  optimal 
portfolio  composition  will  depend  on  a  government  decision  rule  and  a  market  state,  and  the 
government  needs  to  select  a  decision  rule  based  on  the  effects  of  the  investments  of  the  brokers  in 
the  market  to  maximize  the  overall  return. 
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5.5  Production  planning 

Hierarchical  production  planning  problems  have  been  studied  in  operations  research  literature  with 
certain  models  and  assumptions  over  many  years  (see,  e.g.,  [33]  for  references).  In  this  subsection, 
we  give  a  simple  production  planning  problem  in  a  manufacturing  environment  as  the  next  example 
of  the  MMDP  model.  We  base  our  discussion  on  the  problem  studied  in  [6]. 

The  production  planning  problem  we  consider  here  is  divided  into  two  levels:  “marketing  man¬ 
agement”  level  and  “operational”  level.  At  the  marketing  management  level,  we  need  to  control 
which  family  to  produce  over  each  (slow  time-scale)  decision  epoch,  where  a  family  is  a  set  of  items 
consuming  the  same  amount  of  resources  and  sharing  the  same  setup  [6].  The  upper  level  state 
consists  of  (stochastically)  available  resources  for  each  family  and  (stochastic)  setup  costs  for  each 
family,  and  some  market-dependent  factors.  The  upper  level  action  is  to  choose  which  family  to 
produce. 

At  the  operational  level,  we  need  to  determine  actual  quantities  of  the  items  in  the  family 
(the  lower  level  actions)  given  stochastic  (Markovian)  demands  for  the  items,  production  capacity, 
holding  cost,  material  cost,  etc.,  which  will  constitute  a  state  of  the  lower  level. 

The  return  at  the  operational  level  will  be  a  function  of  the  unit  selling  price  of  the  items, 
the  inventory  holding  costs,  the  setup  costs  of  the  (current)  family,  the  production  quantity  of 
the  items,  etc.  and  the  T-epoch  expected  accumulated  return  at  the  operational  level  will  be  the 
one-step  return  for  the  management  level. 

5.6  Employee  staffing 

The  next  example  mirrors  in  some  sense  the  problem  in  a  manufacturing  environment  we  just 
discussed  in  the  previous  section  but  gives  a  good  insight  again  about  the  usefulness  of  the  MMDP 
model  and  is  based  on  the  work  in  [44]. 

Employee  staffing  at  a  service  organization  company  (e.g.,  supermarket,  telephone  call  center, 
etc.)  is  a  very  important  management  problem  because  how  to  approach  to  this  problem  is  directly 
related  with  the  revenue  of  the  company. 

At  the  level-1,  we  have  “employ  hiring”  problem.  Suppose  the  company  categorizes  the  em¬ 
ployees  into  certain  types  depending  on  different  skills  and  service  capacities  at  different  purposes. 
The  state  at  the  level-1  is  the  currently  hired  number  of  employees  for  each  type.  This  number  will 
change  at  random  in  a  monthly  basis  due  to  for  example,  downsizing,  company  job-training,  em¬ 
ployee’s  leaving,  etc.  and  we  assume  that  we  have  a  Markovian  model  to  describe  this  phenomenon 
(see,  e.g.,  [45]  for  this  direction).  The  control  at  the  level-1  is  to  hire  how  many  new  employees  for 
a  certain  type  or  not  to  hire. 

Given  the  currently  effective  working  number  of  employees  at  the  company,  depending  on  de¬ 
sired  activity  volume  (stochastic  customer  demands),  work  duration,  (stochastic)  company  budget, 
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corporate  rules,  etc.,  we  need  to  dispatch  the  right  number  of  the  employees  for  each  type  to  the 
right  place  in  a  weekly  basis.  This  will  define  the  level-2  MDP  problem  often  called  “workforce 
scheduling”  problem. 

At  the  lowest  level-3  (a  fast  time-scale),  we  need  to  solve  a  real-time  control  problem.  For 
example,  at  a  telephone  call  center,  we  need  to  assign  incoming  calls,  which  will  be  stochastic,  to 
available  customer  service  representatives. 

5.7  Composite  (controlled)  performance  and  dependability  model 

In  this  subsection,  we  give  an  extension  of  the  example  discussed  by  Goseva-Popstojanova  and 
Trivedi  [15]  to  show  how  we  can  use  MMDP  to  extend  their  hierarchical  model  of  performance  and 
dependability  (or  availability)  into  an  interesting  model  that  incorporates  controls.  We  don’t  base 
our  discussion  on  the  exactly  same  example  given  in  [15].  Our  example  is  appropriately  modified 
from  the  example  into  our  contexts. 

There  are  N  multiprocessors,  where  each  processor  has  a  different  service  characteristic  and  a 
different  failure  rate.  The  concept  of  “dependability”  is  related  with  failure,  reconfiguration,  and 
repair  of  these  processors  with  respect  to  system  behavior,  and  closely  related  with  dependability, 
availability  measures  a  potential  service  capacity  that  the  system  can  deliver  [15].  For  our  extension, 
we  naturally  introduce  a  “maintenance”  operation  in  a  slow  time-scale.  We  wish  to  maintain  the 
system  such  that  a  better  availability /dependability  is  achieved.  The  measure  of  availability  will 
depend  on  the  performance  made  by  each  processor  in  a  fast  time-scale. 

Each  processor  k’s  upper  level  state  has  two  parts.  The  first  part  indicates  whether  the  processor 
is  up  or  down.  The  second  part  is  the  processor  fc’s  currently  effective  exponential  service  rate  S 
in  a  finite  set  of  {S^  i, ...,  m}  if  the  first  part  is  up,  where  we  assume  that  there  is  a  Markov 
process  that  describes  transition  of  service  rates  for  each  processor  k  in  the  slow  time-scale.  If  the 
first  part  is  down,  the  second  part  indicates  a  current  “symptom”  or  reasons  of  failure  from  a  finite 
set  of  {F/.1 , ...,  Fktm}-  The  processor  k  is  subject  to  failure  F^^n  with  an  exponential  rate  fk,n-  An 
upper  level  action  is  to  select  one  processor  to  be  repaired  among  currently  failed  processors.  The 
immediate  repair  cost  at  the  upper  level  is,  for  example,  a  time  to  repair  associated  with  symptom 
F. 

At  the  lower  level,  jobs  or  customers  of  type  k  (associated  with  weight  w^)  arrive  at  processor 
k  according  to  Poisson  process  with  an  arrival  rate  of  a*,.  Each  job  requires  a  processing  time  that 
is  exponentially  distributed  with  a  rate  and  it  is  assumed  that  on  a  job’s  arrival  the  processing 
time  is  known  to  the  processor.  The  lower  level  state  is  the  queue  size  at  each  processor  k  in  terms 
of  the  total  processing  time  for  the  jobs  in  the  queue.  The  lower  level  action  for  each  processor  is 
to  do  admission  control  to  newly  arriving  jobs  based  on  the  current  processor’s  service  rate  (and 
its  future  stochastic  variation  if  we  use  the  lower  level  dependent  d-function)  to  effectively  control 
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the  throughput  and  the  waiting  time  of  the  processed  jobs  based  on  a  given  tradeoff  parameter. 
The  performance  made  by  each  processor  k  times  will  contribute  to  the  upper  level  single  step 
reward. 

Our  overall  goal  in  this  example  is  to  maintain  the  system  (at  both  levels)  to  make  overall 
system’s  availability  optimal  over  infinite  horizon.  In  addition  to  our  example  here,  many  interesting 
variants  of  Trivedi  et  al.  ’s  composite  performance  and  availability  model  can  be  considered  via  the 
MMDP  modeling. 

5.8  Salesman’s  travel  planning 

As  the  final  example,  we  consider  a  simple  but  somewhat  artificial  stochastic  variant  of  a  deter¬ 
ministic  graph-based,  combinatorial  optimization  problem.  Consider  a  salesman  who  needs  to  travel 
from  a  source  city  to  a  destination  city  by  a  long  load  trip.  We  assume  that  there  is  a  connected 
network  of  possible  routes  of  intermediate  cities  between  the  source  and  the  destination. 

The  upper  level  problem  is  just  a  variant  of  the  shortest  path  problem.  An  upper  level  state 
is  a  city  and  an  upper  level  action  is  the  choice  of  the  next  city  to  visit.  The  upper  level  state 
transition  will  be  deterministic  in  this  case  and  the  destination  city  will  be  the  termination  state 
in  the  upper  level  MDP  such  that  once  entered,  the  salesman  stays  forever  with  the  zero  cost. 

The  salesman  needs  to  visit  small  towns  along  the  trip  between  a  pair  of  cities.  Between  two 
towns,  he  needs  to  select  a  method  of  transportation  (taxi,  bus,  train,  or  walk,  etc.)  depending 
on  his  current  budget,  stochastic  load  condition  due  to  the  weather  for  example,  and  so  forth.  It 
is  assumed  that  the  hotel,  food,  etc.  prices  are  randomly  changing  too  according  to  a  Markovian 
model  so  that  he  needs  to  consider  this  random  factor  in  his  decision  time-scale  when  he  makes  a 
decision.  The  cost  will  be  a  function  of  the  money  he  spends  and  the  travel  time  by  the  selected 
transportation.  This  will  determine  the  lower  level  MDP  dynamics.  The  budget  plan  will  affect  the 
decision  at  the  upper  level  to  choose  the  next  city  to  visit  and  the  goal  is  to  minimize  the  overall 
cost  of  traveling  from  the  source  city  to  the  destination  city. 

We  remark  that  the  problem  above  can  be  made  more  interesting  if  we  incorporate  “multi¬ 
time  scale”  budget  condition  into  the  upper  level  MDP  dynamics  so  that  a  particular  city  must 
be  excluded  from  the  travel  plan  (a  random  topological  change  in  the  graph).  There  are  many 
interesting  stochastic  variants  of  deterministic  graph-based  combinatorial  optimization  problems 
modeled  as  stochastic  shortest  path  problems,  which  consist  of  a  subclass  of  the  MDP  model  [3]  [5] 
(see,  e.g.,  vehicle  routing  problems  in  [32]  and  references  therein).  Usually,  those  graph-based 
problems  are  associated  with  certain  weight  functions  in  the  arcs  in  the  graphs.  We  can  consider 
lower  level  MDP  problems  that  replace  those  weight  functions  giving  rise  to  MMDP  problems. 
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6  Concluding  Remarks 

We  proposed  in  this  paper  an  analytical  model  called  multi-time  scale  MDP  for  a  class  of  hierarchi¬ 
cally  structured  sequential  decision  making  processes.  The  model  describes  interactions  between 
levels  in  the  hierarchy  in  a  pyramid  sense.  The  upper  level  state  and/or  control  induces  the  lower 
level  MDP  dynamics  over  a  finite  horizon  of  length  corresponding  to  the  relevant  scale  factor.  The 
performance  measure  obtained  from  the  lower  level  will  affect  the  upper  level  decision  making. 

A  hierarchical  objective  functions  for  this  model  was  devised  and  corresponding  optimality 
equations  were  established,  from  which  for  each  objective  function,  we  could  derive  the  property  of 
optimal  value  functions  and  the  existence  of  optimal  policies  at  all  levels.  We  then  presented  the 
exact  and  approximate  solution  methods  and  also  discussed  heuristic  (on-line)  methods  to  solve 
MMDPs. 

In  the  evolutionary  process  of  MDPs,  the  outcome  of  taking  an  action  at  a  state  is  the  next 
state.  Usually,  the  matter  of  when  this  outcome  is  known  to  the  system  is  not  critical  as  long  as 
the  system  comes  to  know  the  next  state  before  the  next  decision  time.  However,  this  might  be 
an  issue  on  the  MMDP  model.  In  our  model,  we  assumed  that  the  next  state  at  the  upper  level  is 
known  at  the  near  boundary  of  the  next  time  step  (refer  Figure  1),  which  is  quite  reasonable  we 
believe.  If  the  effect  of  taking  an  action  A  G  A  at  a  state  i  G  I  is  immediate,  which  is  the  next  state 
j  €  /,  Pl(y\x,i,  A)  will  be  possibly  given  as  Pl(y\x,j).  This  issue  is  the  problem  specific  matter 
and  needs  to  be  resolved  by  the  system  design. 

We  made  the  assumption  that  action  spaces  at  all  levels  in  the  hierarchy  are  distinct.  Even 
though  we  believe  that  this  is  a  natural  assumption,  we  speculate  that  for  some  applications,  some 
actions  might  be  shared  by  different  levels.  Our  assumption  can  be  relaxed  (with  added  complexity 
to  the  model)  so  that  some  actions  are  shared  by  different  levels  as  long  as  any  action  taken  at  a 
state  in  a  level  does  not  affect  the  higher  level  state  transitions.  Developing  a  model  for  the  case 
where  a  lower  level  action  affects  the  higher  level  transitions  is  still  open  problem. 

The  extension  of  our  model  into  partially  observable  MMDP  is  straightforward  because  a  par¬ 
tially  observable  MDP  can  be  transformed  into  an  MDP  with  information  state  space  (see,  e.g.,  [1]). 
Furthermore,  it  will  be  interesting  to  extend  our  model  into  the  Markov  game  settings  making  multi¬ 
time  scaled  Markov  games.  The  “optimal  equilibrium  value  of  game”  over  a  finite  horizon  at  lower 
level  game  will  be  used  as  one-step  cost /reward  for  the  upper  level  game.  We  believe  that  the 
MMDP  model  we  proposed  has  many  applications,  in  particular  in  the  areas  of  stochastic  queueing 
control  and  production  planning. 
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